JianNan ZHANG JiJun ZHOU JianFeng WU ShengYing YANG
Convolutional neural networks (CNNS) have a strong ability to understand and judge images. However, the enormous parameters and computation of CNNS have limited its application in resource-limited devices. In this letter, we used the idea of parameter sharing and dense connection to compress the parameters in the convolution kernel channel direction, thus greatly reducing the number of model parameters. On this basis, we designed Shared and Dense Channel-wise Convolutional Networks (SDChannelNets), mainly composed of Depth-wise Separable SD-Channel-wise Convolution layer. The advantage of SDChannelNets is that the number of model parameters is greatly reduced without or with little loss of accuracy. We also introduced a hyperparameter that can effectively balance the number of parameters and the accuracy of a model. We evaluated the model proposed by us through two popular image recognition tasks (CIFAR-10 and CIFAR-100). The results showed that SDChannelNets had similar accuracy to other CNNs, but the number of parameters was greatly reduced.
Naranchimeg BOLD Chao ZHANG Takuya AKASHI
In recent decade, many state-of-the-art algorithms on image classification as well as audio classification have achieved noticeable successes with the development of deep convolutional neural network (CNN). However, most of the works only exploit single type of training data. In this paper, we present a study on classifying bird species by exploiting the combination of both visual (images) and audio (sounds) data using CNN, which has been sparsely treated so far. Specifically, we propose CNN-based multimodal learning models in three types of fusion strategies (early, middle, late) to settle the issues of combining training data cross domains. The advantage of our proposed method lies on the fact that we can utilize CNN not only to extract features from image and audio data (spectrogram) but also to combine the features across modalities. In the experiment, we train and evaluate the network structure on a comprehensive CUB-200-2011 standard data set combing our originally collected audio data set with respect to the data species. We observe that a model which utilizes the combination of both data outperforms models trained with only an either type of data. We also show that transfer learning can significantly increase the classification performance.
Meng Ting XIONG Yong FENG Ting WU Jia Xing SHANG Bao Hua QIANG Ya Nan WANG
The traditional recommendation system (RS) can learn the potential personal preferences of users and potential attribute characteristics of items through the rating records between users and items to make recommendations.However, for the new items with no historical rating records,the traditional RS usually suffers from the typical cold start problem. Additional auxiliary information has usually been used in the item cold start recommendation,we further bring temporal dynamics,text and relevance in our models to release item cold start.Two new cold start recommendation models TmTx(Time,Text) and TmTI(Time,Text,Item correlation) proposed to solve the item cold start problem for different cold start scenarios.While well-known methods like TimeSVD++ and CoFactor partially take temporal dynamics,comments,and item correlations into consideration to solve the cold start problem but none of them combines these information together.Two models proposed in this paper fused features such as time,text,and relevance can effectively improve the performance under item cold start.We select the convolutional neural network (CNN) to extract features from item description text which provides the model the ability to deal with cold start items.Both proposed models can effectively improve the performance with item cold start.Experimental results on three real-world data set show that our proposed models lead to significant improvement compared with the baseline methods.
Di YANG Songjiang LI Zhou PENG Peng WANG Junhui WANG Huamin YANG
Accurate traffic flow prediction is the precondition for many applications in Intelligent Transportation Systems, such as traffic control and route guidance. Traditional data driven traffic flow prediction models tend to ignore traffic self-features (e.g., periodicities), and commonly suffer from the shifts brought by various complex factors (e.g., weather and holidays). These would reduce the precision and robustness of the prediction models. To tackle this problem, in this paper, we propose a CNN-based multi-feature predictive model (MF-CNN) that collectively predicts network-scale traffic flow with multiple spatiotemporal features and external factors (weather and holidays). Specifically, we classify traffic self-features into temporal continuity as short-term feature, daily periodicity and weekly periodicity as long-term features, then map them to three two-dimensional spaces, which each one is composed of time and space, represented by two-dimensional matrices. The high-level spatiotemporal features learned by CNNs from the matrices with different time lags are further fused with external factors by a logistic regression layer to derive the final prediction. Experimental results indicate that the MF-CNN model considering multi-features improves the predictive performance compared to five baseline models, and achieves the trade-off between accuracy and efficiency.
Yan CHEN Jing ZHANG Yuebing XU Yingjie ZHANG Renyuan ZHANG Yasuhiko NAKASHIMA
An efficient resistive random access memory (ReRAM) structure is developed for accelerating convolutional neural network (CNN) powered by the in-memory computation. A novel ReRAM cell circuit is designed with two-directional (2-D) accessibility. The entire memory system is organized as a 2-D array, in which specific memory cells can be identically accessed by both of column- and row-locality. For the in-memory computations of CNNs, only relevant cells in an identical sub-array are accessed by 2-D read-out operations, which is hardly implemented by conventional ReRAM cells. In this manner, the redundant access (column or row) of the conventional ReRAM structures is prevented to eliminated the unnecessary data movement when CNNs are processed in-memory. From the simulation results, the energy and bandwidth efficiency of the proposed memory structure are 1.4x and 5x of a state-of-the-art ReRAM architecture, respectively.
Ruicong ZHI Hairui XU Ming WAN Tingting LI
Facial micro-expression is momentary and subtle facial reactions, and it is still challenging to automatically recognize facial micro-expression with high accuracy in practical applications. Extracting spatiotemporal features from facial image sequences is essential for facial micro-expression recognition. In this paper, we employed 3D Convolutional Neural Networks (3D-CNNs) for self-learning feature extraction to represent facial micro-expression effectively, since the 3D-CNNs could well extract the spatiotemporal features from facial image sequences. Moreover, transfer learning was utilized to deal with the problem of insufficient samples in the facial micro-expression database. We primarily pre-trained the 3D-CNNs on normal facial expression database Oulu-CASIA by supervised learning, then the pre-trained model was effectively transferred to the target domain, which was the facial micro-expression recognition task. The proposed method was evaluated on two available facial micro-expression datasets, i.e. CASME II and SMIC-HS. We obtained the overall accuracy of 97.6% on CASME II, and 97.4% on SMIC, which were 3.4% and 1.6% higher than the 3D-CNNs model without transfer learning, respectively. And the experimental results demonstrated that our method achieved superior performance compared to state-of-the-art methods.
Suofei ZHANG Bin KANG Lin ZHOU
Instance features based deep learning methods prompt the performances of high speed object tracking systems by directly comparing target with its template during training and tracking. However, from the perspective of human vision system, prior knowledge of target also plays key role during the process of tracking. To integrate both semantic knowledge and instance features, we propose a convolutional network based object tracking framework to simultaneously output bounding boxes based on different prior knowledge as well as confidences of corresponding Assumptions. Experimental results show that our proposed approach retains both higher accuracy and efficiency than other leading methods on tracking tasks covering most daily objects.
Yundong LI Weigang ZHAO Xueyan ZHANG Qichen ZHOU
Crack detection is a vital task to maintain a bridge's health and safety condition. Traditional computer-vision based methods easily suffer from disturbance of noise and clutters for a real bridge inspection. To address this limitation, we propose a two-stage crack detection approach based on Convolutional Neural Networks (CNN) in this letter. A predictor of small receptive field is exploited in the first detection stage, while another predictor of large receptive field is used to refine the detection results in the second stage. Benefiting from data fusion of confidence maps produced by both predictors, our method can predict the probability belongs to cracked areas of each pixel accurately. Experimental results show that the proposed method is superior to an up-to-date method on real concrete surface images.
Zhengxue CHENG Masaru TAKEUCHI Kenji KANAI Jiro KATTO
Image quality assessment (IQA) is an inherent problem in the field of image processing. Recently, deep learning-based image quality assessment has attracted increased attention, owing to its high prediction accuracy. In this paper, we propose a fully-blind and fast image quality predictor (FFIQP) using convolutional neural networks including two strategies. First, we propose a distortion clustering strategy based on the distribution function of intermediate-layer results in the convolutional neural network (CNN) to make IQA fully blind. Second, by analyzing the relationship between image saliency information and CNN prediction error, we utilize a pre-saliency map to skip the non-salient patches for IQA acceleration. Experimental results verify that our method can achieve the high accuracy (0.978) with subjective quality scores, outperforming existing IQA methods. Moreover, the proposed method is highly computationally appealing, achieving flexible complexity performance by assigning different thresholds in the saliency map.
Jinhua WANG Weiqiang WANG Guangmei XU Hongzhe LIU
In this paper, we describe the direct learning of an end-to-end mapping between under-/over-exposed images and well-exposed images. The mapping is represented as a deep convolutional neural network (CNN) that takes multiple-exposure images as input and outputs a high-quality image. Our CNN has a lightweight structure, yet gives state-of-the-art fusion quality. Furthermore, we know that for a given pixel, the influence of the surrounding pixels gradually increases as the distance decreases. If the only pixels considered are those in the convolution kernel neighborhood, the final result will be affected. To overcome this problem, the size of the convolution kernel is often increased. However, this also increases the complexity of the network (too many parameters) and the training time. In this paper, we present a method in which a number of sub-images of the source image are obtained using the same CNN model, providing more neighborhood information for the convolution operation. Experimental results demonstrate that the proposed method achieves better performance in terms of both objective evaluation and visual quality.
Yulong XU Yang LI Jiabao WANG Zhuang MIAO Hang LI Yafei ZHANG
Feature extractor plays an important role in visual tracking, but most state-of-the-art methods employ the same feature representation in all scenes. Taking into account the diverseness, a tracker should choose different features according to the videos. In this work, we propose a novel feature adaptive correlation tracker, which decomposes the tracking task into translation and scale estimation. According to the luminance of the target, our approach automatically selects either hierarchical convolutional features or histogram of oriented gradient features in translation for varied scenarios. Furthermore, we employ a discriminative correlation filter to handle scale variations. Extensive experiments are performed on a large-scale benchmark challenging dataset. And the results show that the proposed algorithm outperforms state-of-the-art trackers in accuracy and robustness.
Shinya OHTANI Yu KATO Nobutaka KUROKI Tetsuya HIROSE Masahiro NUMA
This paper proposes image super-resolution techniques with multi-channel convolutional neural networks. In the proposed method, output pixels are classified into K×K groups depending on their coordinates. Those groups are generated from separate channels of a convolutional neural network (CNN). Finally, they are synthesized into a K×K magnified image. This architecture can enlarge images directly without bicubic interpolation. Experimental results of 2×2, 3×3, and 4×4 magnifications have shown that the average PSNR for the proposed method is about 0.2dB higher than that for the conventional SRCNN.
Recent studies have obtained superior performance in image recognition tasks by using, as an image representation, the fully connected layer activations of Convolutional Neural Networks (CNN) trained with various kinds of images. However, the CNN representation is not very suitable for fine-grained image recognition tasks involving food image recognition. For improving performance of the CNN representation in food image recognition, we propose a novel image representation that is comprised of the covariances of convolutional layer feature maps. In the experiment on the ETHZ Food-101 dataset, our method achieved 58.65% averaged accuracy, which outperforms the previous methods such as the Bag-of-Visual-Words Histogram, the Improved Fisher Vector, and CNN-SVM.